Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct a predictive model using various machine learning algorithms and to document the end-to-end steps using a template. The Sensorless Drive Diagnosis is a multi-class classification situation where we are trying to predict one of the several possible outcomes.

INTRODUCTION: The dataset contains features extracted from electric current drive signals. The drive has both intact and defective components. The signals can result in 11 different classes with different conditions. Each condition has been measured several times by 12 different operating conditions, such as speeds, load moments, and load forces.

In iteration Take1, we established the baseline accuracy measurement for comparison with future rounds of modeling.

In this iteration, we will standardize the numeric attributes and observe the impact of scaling on modeling accuracy.

ANALYSIS: In iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 85.53%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 99.92%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 99.90%, which was even better than the prediction from the training data.

In this iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 85.34%. Two algorithms (Random Forest and Gradient Boosting) achieved the top accuracy metrics after the first round of modeling. After a series of tuning trials, Random Forest turned in the top overall result and achieved an accuracy metric of 99.92%. After applying the optimized parameters, the Random Forest algorithm processed the testing dataset with an accuracy of 99.90%, which was even better than the prediction from the training data.

By standardizing the dataset features, the ensemble algorithms continued to perform well. However, standardizing the features appeared to have little impact on the overall modeling accuracy.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the best overall training and validation results. For this dataset, Random Forest could be considered for further modeling.

Dataset Used: Sensorless Drive Diagnosis Data Set

Dataset ML Model: Multi-class classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Dataset+for+Sensorless+Drive+Diagnosis

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Environment
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Environment

1.a) Load libraries and packages

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(stringr)

1.b) Set up the controlling parameters and functions

# Create the random seed number for reproducible results
seedNum <- 888

# Set up the notifyStatus flag to stop sending progress emails (setting to TRUE will send status emails!)
notifyStatus <- TRUE
if (notifyStatus) library(mailR)
## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3
# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"
# Set up the email notification function
email_notify <- function(msg=""){
  sender <- Sys.getenv("MAIL_SENDER")
  receiver <- Sys.getenv("MAIL_RECEIVER")
  gateway <- Sys.getenv("SMTP_GATEWAY")
  smtpuser <- Sys.getenv("SMTP_USERNAME")
  password <- Sys.getenv("SMTP_PASSWORD")
  sbj_line <- "Notification from R Multi-Class Classification Script"
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = gateway, port = 587, user.name = smtpuser, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
if (notifyStatus) email_notify(paste("Library and Data Loading has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3b6eb2ec}"

1.c) Load dataset

# Slicing up the document path to get the final destination file name
dataset_path <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00325/Sensorless_drive_diagnosis.txt'
doc_path_list <- str_split(dataset_path, "/")
dest_file <- doc_path_list[[1]][length(doc_path_list[[1]])]

if (!file.exists(dest_file)) {
  # Download the document from the website
  cat("Downloading", dataset_path, "as", dest_file, "\n")
  download.file(dataset_path, dest_file, mode = "wb")
  cat(dest_file, "downloaded!\n")
#  unzip(dest_file)
#  cat(dest_file, "unpacked!\n")
}

inputFile <- dest_file
colNames <- paste0("attr",1:48)
colNames <- c(colNames, 'targetVar')
Xy_original <- read.csv(inputFile, sep=' ', header=FALSE, col.names = colNames)
# Take a peek at the dataframe after the import
head(Xy_original)
##         attr1       attr2       attr3       attr4       attr5       attr6
## 1 -3.0146e-07  8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2  2.9132e-06 -5.2477e-06  3.3421e-06 -6.0561e-06  2.7789e-06 -3.7524e-06
## 3 -2.9517e-06 -3.1840e-06 -1.5920e-05 -1.2084e-06 -1.5753e-06  1.7394e-05
## 4 -1.3226e-06  8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07  4.1439e-06
## 5 -6.8366e-08  5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07  1.3491e-05
## 6 -9.5849e-07  5.2143e-08 -4.7359e-05  6.4537e-07 -2.3041e-06  5.4999e-05
##      attr7    attr8    attr9    attr10    attr11    attr12     attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 3 0.032877 0.032880 0.032896 -0.029834 -0.029832 -0.029849 0.00076385
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 6 0.031154 0.031154 0.031201 -0.032789 -0.032787 -0.032842 0.00076713
##       attr14     attr15     attr16     attr17     attr18  attr19  attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 3 0.00022992 0.00056024 0.00075789 0.00023620 0.00071163 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 6 0.00025203 0.00064273 0.00075793 0.00026632 0.00073583 0.89458 0.89458
##    attr21  attr22  attr23  attr24      attr25    attr26   attr27
## 1 0.89669 0.89658 0.89658 0.89656  0.00768040  0.257360 -0.71184
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592
## 3 0.89581 0.89619 0.89619 0.89621  0.00595100 -0.075239 -0.22862
## 4 0.89479 0.89576 0.89576 0.89572  0.00205630  0.466570  0.56841
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395
## 6 0.89455 0.89573 0.89572 0.89572  0.00048305  0.164030 -0.13124
##       attr28    attr29   attr30     attr31     attr32     attr33
## 1 0.00487890 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477
## 2 0.00711140  0.119110  0.31117  0.0010932  0.0010911  0.0010682
## 3 0.00044468 -0.162300  0.56210  0.0028942  0.0029030  0.0028851
## 4 0.00693590 -0.467240  0.22673 -0.0012546 -0.0012421 -0.0012774
## 5 0.00561650  0.343380  0.84307 -0.0038112 -0.0038040 -0.0038000
## 6 0.00129750 -0.048862  2.21850 -0.0028981 -0.0028984 -0.0028680
##        attr34      attr35      attr36   attr37  attr38  attr39   attr40
## 1 -0.00437770 -0.00438410 -0.00438930 -0.66732  4.3662  6.0168 -0.63308
## 2 -0.00134400 -0.00134190 -0.00137550 -0.65404  1.3977  3.6048 -0.59314
## 3  0.00035014  0.00035803  0.00037366 -0.67146  2.8072  5.8007 -0.63252
## 4 -0.00497380 -0.00496550 -0.00497560 -0.67766  7.8629 23.3960 -0.62289
## 5 -0.00465540 -0.00464600 -0.00463950 -0.65867 14.8720  5.0582 -0.63010
## 6 -0.00151920 -0.00151880 -0.00140970 -0.65298  7.3162  3.9757 -0.61124
##   attr41  attr42  attr43  attr44  attr45  attr46  attr47  attr48 targetVar
## 1 2.9646  8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996         1
## 2 7.6252  6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005         1
## 3 2.7784  5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985         1
## 4 6.5534  6.2606 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976         1
## 5 4.5155  9.5231 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959         1
## 6 5.8337 18.6970 -1.4956 -1.4956 -1.4956 -1.4973 -1.4972 -1.4973         1
sapply(Xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "integer"
sapply(Xy_original, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
##         0         0         0         0         0         0         0

1.d) Data Cleaning

# Convert columns from one data type to another
Xy_original$targetVar <- as.factor(Xy_original$targetVar)
# Take a peek at the dataframe after the cleaning
head(Xy_original)
##         attr1       attr2       attr3       attr4       attr5       attr6
## 1 -3.0146e-07  8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2  2.9132e-06 -5.2477e-06  3.3421e-06 -6.0561e-06  2.7789e-06 -3.7524e-06
## 3 -2.9517e-06 -3.1840e-06 -1.5920e-05 -1.2084e-06 -1.5753e-06  1.7394e-05
## 4 -1.3226e-06  8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07  4.1439e-06
## 5 -6.8366e-08  5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07  1.3491e-05
## 6 -9.5849e-07  5.2143e-08 -4.7359e-05  6.4537e-07 -2.3041e-06  5.4999e-05
##      attr7    attr8    attr9    attr10    attr11    attr12     attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 3 0.032877 0.032880 0.032896 -0.029834 -0.029832 -0.029849 0.00076385
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 6 0.031154 0.031154 0.031201 -0.032789 -0.032787 -0.032842 0.00076713
##       attr14     attr15     attr16     attr17     attr18  attr19  attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 3 0.00022992 0.00056024 0.00075789 0.00023620 0.00071163 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 6 0.00025203 0.00064273 0.00075793 0.00026632 0.00073583 0.89458 0.89458
##    attr21  attr22  attr23  attr24      attr25    attr26   attr27
## 1 0.89669 0.89658 0.89658 0.89656  0.00768040  0.257360 -0.71184
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592
## 3 0.89581 0.89619 0.89619 0.89621  0.00595100 -0.075239 -0.22862
## 4 0.89479 0.89576 0.89576 0.89572  0.00205630  0.466570  0.56841
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395
## 6 0.89455 0.89573 0.89572 0.89572  0.00048305  0.164030 -0.13124
##       attr28    attr29   attr30     attr31     attr32     attr33
## 1 0.00487890 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477
## 2 0.00711140  0.119110  0.31117  0.0010932  0.0010911  0.0010682
## 3 0.00044468 -0.162300  0.56210  0.0028942  0.0029030  0.0028851
## 4 0.00693590 -0.467240  0.22673 -0.0012546 -0.0012421 -0.0012774
## 5 0.00561650  0.343380  0.84307 -0.0038112 -0.0038040 -0.0038000
## 6 0.00129750 -0.048862  2.21850 -0.0028981 -0.0028984 -0.0028680
##        attr34      attr35      attr36   attr37  attr38  attr39   attr40
## 1 -0.00437770 -0.00438410 -0.00438930 -0.66732  4.3662  6.0168 -0.63308
## 2 -0.00134400 -0.00134190 -0.00137550 -0.65404  1.3977  3.6048 -0.59314
## 3  0.00035014  0.00035803  0.00037366 -0.67146  2.8072  5.8007 -0.63252
## 4 -0.00497380 -0.00496550 -0.00497560 -0.67766  7.8629 23.3960 -0.62289
## 5 -0.00465540 -0.00464600 -0.00463950 -0.65867 14.8720  5.0582 -0.63010
## 6 -0.00151920 -0.00151880 -0.00140970 -0.65298  7.3162  3.9757 -0.61124
##   attr41  attr42  attr43  attr44  attr45  attr46  attr47  attr48 targetVar
## 1 2.9646  8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996         1
## 2 7.6252  6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005         1
## 3 2.7784  5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985         1
## 4 6.5534  6.2606 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976         1
## 5 4.5155  9.5231 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959         1
## 6 5.8337 18.6970 -1.4956 -1.4956 -1.4956 -1.4973 -1.4972 -1.4973         1
sapply(Xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"  "factor"
sapply(Xy_original, function(x) sum(is.na(x)))
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
##         0         0         0         0         0         0         0 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
##         0         0         0         0         0         0         0 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
##         0         0         0         0         0         0         0 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
##         0         0         0         0         0         0         0 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
##         0         0         0         0         0         0         0 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
##         0         0         0         0         0         0         0 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
##         0         0         0         0         0         0         0

1.e) Splitting Data into Training and Test Sets

# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(Xy_original)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1

# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol

# Standardize the class column to the name of targetVar if applicable
# colnames(Xy_original)[targetCol] <- "targetVar"
# We create attribute-only and target-only datasets (X_original and y_original)
# for various visualization and cleaning/transformation operations

if (targetCol==1) {
  X_original <- Xy_original[,(targetCol+1):totCol]
  y_original <- Xy_original[,targetCol]
} else {
  X_original <- Xy_original[,1:(totAttr)]
  y_original <- Xy_original[,totCol]
}

dim(Xy_original)
## [1] 58509    49
dim(X_original)
## [1] 58509    48

1.f) Set up the parameters for data visualization

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 3
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  3  by  16
if (notifyStatus) email_notify(paste("Library and Data Loading completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@30c7da1e}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

if (notifyStatus) email_notify(paste("Data Summarization and Visualization has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2812cbfa}"

2.a) Descriptive statistics

2.a.i) Peek at the data itself

head(Xy_original)
##         attr1       attr2       attr3       attr4       attr5       attr6
## 1 -3.0146e-07  8.2603e-06 -1.1517e-05 -2.3098e-06 -1.4386e-06 -2.1225e-05
## 2  2.9132e-06 -5.2477e-06  3.3421e-06 -6.0561e-06  2.7789e-06 -3.7524e-06
## 3 -2.9517e-06 -3.1840e-06 -1.5920e-05 -1.2084e-06 -1.5753e-06  1.7394e-05
## 4 -1.3226e-06  8.8201e-06 -1.5879e-05 -4.8111e-06 -7.2829e-07  4.1439e-06
## 5 -6.8366e-08  5.6663e-07 -2.5906e-05 -6.4901e-06 -7.9406e-07  1.3491e-05
## 6 -9.5849e-07  5.2143e-08 -4.7359e-05  6.4537e-07 -2.3041e-06  5.4999e-05
##      attr7    attr8    attr9    attr10    attr11    attr12     attr13
## 1 0.031718 0.031710 0.031721 -0.032963 -0.032962 -0.032941 0.00076881
## 2 0.030804 0.030810 0.030806 -0.033520 -0.033522 -0.033519 0.00076614
## 3 0.032877 0.032880 0.032896 -0.029834 -0.029832 -0.029849 0.00076385
## 4 0.029410 0.029401 0.029417 -0.030156 -0.030155 -0.030159 0.00076950
## 5 0.030119 0.030119 0.030145 -0.031393 -0.031392 -0.031405 0.00076335
## 6 0.031154 0.031154 0.031201 -0.032789 -0.032787 -0.032842 0.00076713
##       attr14     attr15     attr16     attr17     attr18  attr19  attr20
## 1 0.00023244 0.00059982 0.00075698 0.00024722 0.00072498 0.89669 0.89669
## 2 0.00022071 0.00048534 0.00075479 0.00025208 0.00066780 0.89583 0.89583
## 3 0.00022992 0.00056024 0.00075789 0.00023620 0.00071163 0.89583 0.89583
## 4 0.00024423 0.00075301 0.00075545 0.00025668 0.00075448 0.89480 0.89481
## 5 0.00024924 0.00062287 0.00075629 0.00022513 0.00061220 0.89656 0.89656
## 6 0.00025203 0.00064273 0.00075793 0.00026632 0.00073583 0.89458 0.89458
##    attr21  attr22  attr23  attr24      attr25    attr26   attr27
## 1 0.89669 0.89658 0.89658 0.89656  0.00768040  0.257360 -0.71184
## 2 0.89580 0.89677 0.89677 0.89673 -0.00940220 -0.059481 -0.29592
## 3 0.89581 0.89619 0.89619 0.89621  0.00595100 -0.075239 -0.22862
## 4 0.89479 0.89576 0.89576 0.89572  0.00205630  0.466570  0.56841
## 5 0.89655 0.89521 0.89520 0.89520 -0.00086017 -0.904870 -0.57395
## 6 0.89455 0.89573 0.89572 0.89572  0.00048305  0.164030 -0.13124
##       attr28    attr29   attr30     attr31     attr32     attr33
## 1 0.00487890 -0.095775 -0.44126 -0.0013168 -0.0013189 -0.0012477
## 2 0.00711140  0.119110  0.31117  0.0010932  0.0010911  0.0010682
## 3 0.00044468 -0.162300  0.56210  0.0028942  0.0029030  0.0028851
## 4 0.00693590 -0.467240  0.22673 -0.0012546 -0.0012421 -0.0012774
## 5 0.00561650  0.343380  0.84307 -0.0038112 -0.0038040 -0.0038000
## 6 0.00129750 -0.048862  2.21850 -0.0028981 -0.0028984 -0.0028680
##        attr34      attr35      attr36   attr37  attr38  attr39   attr40
## 1 -0.00437770 -0.00438410 -0.00438930 -0.66732  4.3662  6.0168 -0.63308
## 2 -0.00134400 -0.00134190 -0.00137550 -0.65404  1.3977  3.6048 -0.59314
## 3  0.00035014  0.00035803  0.00037366 -0.67146  2.8072  5.8007 -0.63252
## 4 -0.00497380 -0.00496550 -0.00497560 -0.67766  7.8629 23.3960 -0.62289
## 5 -0.00465540 -0.00464600 -0.00463950 -0.65867 14.8720  5.0582 -0.63010
## 6 -0.00151920 -0.00151880 -0.00140970 -0.65298  7.3162  3.9757 -0.61124
##   attr41  attr42  attr43  attr44  attr45  attr46  attr47  attr48 targetVar
## 1 2.9646  8.1198 -1.4961 -1.4961 -1.4961 -1.4996 -1.4996 -1.4996         1
## 2 7.6252  6.1690 -1.4967 -1.4967 -1.4967 -1.5005 -1.5005 -1.5005         1
## 3 2.7784  5.3017 -1.4983 -1.4983 -1.4982 -1.4985 -1.4985 -1.4985         1
## 4 6.5534  6.2606 -1.4963 -1.4963 -1.4963 -1.4975 -1.4975 -1.4976         1
## 5 4.5155  9.5231 -1.4958 -1.4958 -1.4958 -1.4959 -1.4959 -1.4959         1
## 6 5.8337 18.6970 -1.4956 -1.4956 -1.4956 -1.4973 -1.4972 -1.4973         1

2.a.ii) Dimensions of the dataset

dim(Xy_original)
## [1] 58509    49

2.a.iii) Types of the attribute

sapply(Xy_original, class)
##     attr1     attr2     attr3     attr4     attr5     attr6     attr7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     attr8     attr9    attr10    attr11    attr12    attr13    attr14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr15    attr16    attr17    attr18    attr19    attr20    attr21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr22    attr23    attr24    attr25    attr26    attr27    attr28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr29    attr30    attr31    attr32    attr33    attr34    attr35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr36    attr37    attr38    attr39    attr40    attr41    attr42 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    attr43    attr44    attr45    attr46    attr47    attr48 targetVar 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"  "factor"

2.a.iv) Statistical summary of the attributes

summary(Xy_original)
##      attr1                attr2                attr3           
##  Min.   :-1.372e-02   Min.   :-5.414e-03   Min.   :-1.358e-02  
##  1st Qu.:-7.431e-06   1st Qu.:-1.444e-05   1st Qu.:-7.240e-05  
##  Median :-2.653e-06   Median : 8.800e-07   Median : 5.140e-07  
##  Mean   :-3.333e-06   Mean   : 1.440e-06   Mean   : 1.412e-06  
##  3rd Qu.: 1.571e-06   3rd Qu.: 1.878e-05   3rd Qu.: 7.520e-05  
##  Max.   : 5.784e-03   Max.   : 4.525e-03   Max.   : 5.238e-03  
##                                                                
##      attr4                attr5                attr6           
##  Min.   :-1.279e-02   Min.   :-8.356e-03   Min.   :-9.741e-03  
##  1st Qu.:-5.418e-06   1st Qu.:-1.475e-05   1st Qu.:-7.379e-05  
##  Median :-1.059e-06   Median : 7.540e-07   Median :-1.660e-07  
##  Mean   :-1.313e-06   Mean   : 1.351e-06   Mean   :-2.650e-07  
##  3rd Qu.: 3.555e-06   3rd Qu.: 1.906e-05   3rd Qu.: 7.139e-05  
##  Max.   : 1.453e-03   Max.   : 8.245e-04   Max.   : 2.754e-03  
##                                                                
##      attr7               attr8               attr9          
##  Min.   :-0.139890   Min.   :-0.135940   Min.   :-0.130860  
##  1st Qu.:-0.019927   1st Qu.:-0.019951   1st Qu.:-0.019925  
##  Median : 0.013226   Median : 0.013230   Median : 0.013247  
##  Mean   : 0.001915   Mean   : 0.001913   Mean   : 0.001912  
##  3rd Qu.: 0.024770   3rd Qu.: 0.024776   3rd Qu.: 0.024777  
##  Max.   : 0.069125   Max.   : 0.069130   Max.   : 0.069131  
##                                                             
##      attr10             attr11             attr12        
##  Min.   :-0.21864   Min.   :-0.21860   Min.   :-0.21863  
##  1st Qu.:-0.03214   1st Qu.:-0.03216   1st Qu.:-0.03217  
##  Median :-0.01557   Median :-0.01559   Median :-0.01560  
##  Mean   :-0.01190   Mean   :-0.01190   Mean   :-0.01190  
##  3rd Qu.: 0.02061   3rd Qu.: 0.02062   3rd Qu.: 0.02060  
##  Max.   : 0.35258   Max.   : 0.35256   Max.   : 0.35263  
##                                                          
##      attr13              attr14              attr15         
##  Min.   :0.0007509   Min.   :0.0001884   Min.   :0.0003542  
##  1st Qu.:0.0011368   1st Qu.:0.0005992   1st Qu.:0.0012566  
##  Median :0.0021989   Median :0.0011845   Median :0.0029800  
##  Mean   :0.0018763   Mean   :0.0010834   Mean   :0.0030917  
##  3rd Qu.:0.0025265   3rd Qu.:0.0014563   3rd Qu.:0.0043361  
##  Max.   :0.1365700   Max.   :0.0515430   Max.   :0.1039300  
##                                                             
##      attr16              attr17              attr18        
##  Min.   :0.0007445   Min.   :0.0001889   Min.   :0.000357  
##  1st Qu.:0.0011394   1st Qu.:0.0005981   1st Qu.:0.001288  
##  Median :0.0021878   Median :0.0011820   Median :0.002891  
##  Mean   :0.0018665   Mean   :0.0010775   Mean   :0.003076  
##  3rd Qu.:0.0025230   3rd Qu.:0.0014538   3rd Qu.:0.004322  
##  Max.   :0.1087700   Max.   :0.0647640   Max.   :0.078530  
##                                                            
##      attr19           attr20           attr21           attr22      
##  Min.   :0.7976   Min.   :0.7976   Min.   :0.7976   Min.   :0.7984  
##  1st Qu.:1.3274   1st Qu.:1.3274   1st Qu.:1.3267   1st Qu.:1.3287  
##  Median :1.5732   Median :1.5731   Median :1.5729   Median :1.5726  
##  Mean   :1.6183   Mean   :1.6183   Mean   :1.6178   Mean   :1.6178  
##  3rd Qu.:1.8858   3rd Qu.:1.8857   3rd Qu.:1.8849   3rd Qu.:1.8834  
##  Max.   :2.3770   Max.   :2.3769   Max.   :2.3758   Max.   :2.3728  
##                                                                     
##      attr23           attr24           attr25          
##  Min.   :0.7984   Min.   :0.7984   Min.   :-15.796000  
##  1st Qu.:1.3287   1st Qu.:1.3281   1st Qu.: -0.006033  
##  Median :1.5725   Median :1.5724   Median :  0.003020  
##  Mean   :1.6177   Mean   :1.6173   Mean   :  0.001909  
##  3rd Qu.:1.8833   3rd Qu.:1.8825   3rd Qu.:  0.011576  
##  Max.   :2.3726   Max.   :2.3715   Max.   : 28.285000  
##                                                        
##      attr26               attr27              attr28          
##  Min.   :-12.351000   Min.   :-7.959000   Min.   :-11.903000  
##  1st Qu.: -0.205900   1st Qu.:-0.453440   1st Qu.: -0.009230  
##  Median :  0.006513   Median :-0.000126   Median :  0.000168  
##  Mean   :  0.008799   Mean   :-0.003465   Mean   : -0.000157  
##  3rd Qu.:  0.220960   3rd Qu.: 0.445910   3rd Qu.:  0.008671  
##  Max.   : 12.437000   Max.   : 9.580300   Max.   : 18.294000  
##                                                               
##      attr29               attr30              attr31          
##  Min.   :-12.508000   Min.   :-9.976600   Min.   :-5.024e-02  
##  1st Qu.: -0.203390   1st Qu.:-0.448040   1st Qu.:-5.102e-03  
##  Median :  0.008109   Median :-0.004195   Median : 4.520e-04  
##  Mean   :  0.012089   Mean   :-0.009958   Mean   : 1.628e-05  
##  3rd Qu.:  0.225560   3rd Qu.: 0.429030   3rd Qu.: 5.165e-03  
##  Max.   : 10.977000   Max.   : 8.764000   Max.   : 8.638e-02  
##                                                               
##      attr32               attr33               attr34          
##  Min.   :-0.0518910   Min.   :-5.279e-02   Min.   :-0.3377100  
##  1st Qu.:-0.0051129   1st Qu.:-5.109e-03   1st Qu.:-0.0045218  
##  Median : 0.0004506   Median : 4.612e-04   Median :-0.0002775  
##  Mean   : 0.0000142   Mean   : 1.922e-05   Mean   :-0.0000347  
##  3rd Qu.: 0.0051651   3rd Qu.: 5.174e-03   3rd Qu.: 0.0049597  
##  Max.   : 0.0864570   Max.   : 8.655e-02   Max.   : 0.1948200  
##                                                                
##      attr35               attr36               attr37        
##  Min.   :-0.3377000   Min.   :-0.3377500   Min.   :  -0.912  
##  1st Qu.:-0.0045180   1st Qu.:-0.0044891   1st Qu.:  -0.715  
##  Median :-0.0002735   Median :-0.0002740   Median :  -0.664  
##  Mean   :-0.0000376   Mean   :-0.0000316   Mean   :  -0.463  
##  3rd Qu.: 0.0049553   3rd Qu.: 0.0049666   3rd Qu.:  -0.582  
##  Max.   : 0.1902000   Max.   : 0.1850300   Max.   :4015.400  
##                                                              
##      attr38            attr39             attr40        
##  Min.   : -0.618   Min.   :  0.5222   Min.   :  -0.902  
##  1st Qu.:  1.485   1st Qu.:  4.4513   1st Qu.:  -0.715  
##  Median :  3.300   Median :  6.5668   Median :  -0.662  
##  Mean   :  7.447   Mean   :  8.4068   Mean   :  -0.398  
##  3rd Qu.:  8.373   3rd Qu.:  9.9526   3rd Qu.:  -0.574  
##  Max.   :312.520   Max.   :265.3300   Max.   :3670.800  
##                                                         
##      attr41             attr42             attr43           attr44      
##  Min.   : -0.5968   Min.   :  0.3207   Min.   :-1.526   Min.   :-1.526  
##  1st Qu.:  1.4503   1st Qu.:  4.4363   1st Qu.:-1.503   1st Qu.:-1.503  
##  Median :  3.3013   Median :  6.4791   Median :-1.500   Median :-1.500  
##  Mean   :  7.2938   Mean   :  8.2738   Mean   :-1.501   Mean   :-1.501  
##  3rd Qu.:  8.2885   3rd Qu.:  9.8575   3rd Qu.:-1.498   3rd Qu.:-1.498  
##  Max.   :889.9300   Max.   :153.1500   Max.   :-1.458   Max.   :-1.456  
##                                                                         
##      attr45           attr46           attr47           attr48      
##  Min.   :-1.524   Min.   :-1.521   Min.   :-1.523   Min.   :-1.521  
##  1st Qu.:-1.503   1st Qu.:-1.500   1st Qu.:-1.500   1st Qu.:-1.500  
##  Median :-1.500   Median :-1.498   Median :-1.498   Median :-1.498  
##  Mean   :-1.501   Mean   :-1.498   Mean   :-1.498   Mean   :-1.498  
##  3rd Qu.:-1.498   3rd Qu.:-1.496   3rd Qu.:-1.496   3rd Qu.:-1.496  
##  Max.   :-1.456   Max.   :-1.337   Max.   :-1.337   Max.   :-1.337  
##                                                                     
##    targetVar    
##  1      : 5319  
##  2      : 5319  
##  3      : 5319  
##  4      : 5319  
##  5      : 5319  
##  6      : 5319  
##  (Other):26595

2.a.v) Summarize the levels of the class attribute

cbind(freq=table(y_original), percentage=prop.table(table(y_original))*100)
##    freq percentage
## 1  5319   9.090909
## 2  5319   9.090909
## 3  5319   9.090909
## 4  5319   9.090909
## 5  5319   9.090909
## 6  5319   9.090909
## 7  5319   9.090909
## 8  5319   9.090909
## 9  5319   9.090909
## 10 5319   9.090909
## 11 5319   9.090909

2.b) Data visualizations

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(X_original[,i], main=names(X_original)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(X_original[,i], main=names(X_original)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(X_original[,i]), main=names(X_original)[i])
}

# Correlation matrix
correlations <- cor(X_original)
corrplot(correlations, method="circle")

if (notifyStatus) email_notify(paste("Data Summarization and Visualization completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2f7a2457}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

if (notifyStatus) email_notify(paste("Data Cleaning and Transformation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5ebec15}"

3.a) Feature Scaling and Data Pre-Processing

# Apply feature scaling techniques

preProcValues <- preProcess(X_original, method = c("center", "scale", "YeoJohnson"))
X_transformed <- predict(preProcValues, X_original)
Xy_original <- cbind(X_transformed, y_original)
colnames(Xy_original)[totCol] <- "targetVar"
# Histograms each attribute
for(i in 1:totAttr) {
    hist(X_transformed[,i], main=names(X_transformed)[i])
}

3.b) Splitting Data into Training and Test Sets

# Create various sub-datasets for visualization and cleaning/transformation operations.
set.seed(seedNum)

# Use 75% of the data to train the models and the remaining for testing/validation
training_index <- createDataPartition(Xy_original$targetVar, p=0.75, list=FALSE)
Xy_train <- Xy_original[training_index,]
Xy_test <- Xy_original[-training_index,]

if (targetCol==1) {
  y_test <- Xy_test[,targetCol]
} else {
  y_test <- Xy_test[,totCol]
}

3.c) Feature Selection

# Not applicable for this iteration of the project

3.d) Display the Final Datasets for Model-Building

# Finalize the training and testing datasets for the modeling activities
dim(Xy_train)
## [1] 43890    49
dim(Xy_test)
## [1] 14619    49
if (notifyStatus) email_notify(paste("Data Cleaning and Transformation completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@38082d64}"
proc.time()-startTimeScript
##    user  system elapsed 
##  59.849   1.040  70.108

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

For this project, we will evaluate one linear, one non-linear, and three ensemble algorithms:

Linear Algorithm: Linear Discriminant Analysis

Non-Linear Algorithm: Decision Trees (CART)

Ensemble Algorithms: Bagged CART, Random Forest, and Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

startModeling <- proc.time()
# Linear Discriminant Analysis (Classification)
if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@180bc464}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.lda <- train(targetVar~., data=Xy_train, method="lda", metric=metricTarget, trControl=control)
## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear

## Warning in lda.default(x, grouping, ...): variables are collinear
print(fit.lda)
## Linear Discriminant Analysis 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.845614  0.8301754
proc.time()-startTimeModule
##    user  system elapsed 
##   9.614   1.664   8.690
if (notifyStatus) email_notify(paste("Linear Discriminant Analysis modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2d554825}"

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Decision Tree modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4909b8da}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=Xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy    Kappa    
##   0.09927318  0.43572568  0.3792982
##   0.09994987  0.23627250  0.1598997
##   0.10000000  0.09090909  0.0000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.09927318.
proc.time()-startTimeModule
##    user  system elapsed 
##  55.467   1.018  55.413
if (notifyStatus) email_notify(paste("Decision Tree modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@54a097cc}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
if (notifyStatus) email_notify(paste("Bagged CART modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@50f8360d}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=Xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9879471  0.9867419
proc.time()-startTimeModule
##     user   system  elapsed 
## 1107.455   25.287 1105.384
if (notifyStatus) email_notify(paste("Bagged CART modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@337d0578}"
# Random Forest (Regression/Classification)
if (notifyStatus) email_notify(paste("Random Forest modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2669b199}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9989747  0.9988722
##   25    0.9933470  0.9926817
##   48    0.9899294  0.9889223
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##     user   system  elapsed 
## 5167.401    8.427 5185.903
if (notifyStatus) email_notify(paste("Random Forest modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3c756e4d}"
# Gradient Boosting (Regression/Classification)
if (notifyStatus) email_notify(paste("Gradient Boosting modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4439f31e}"
startTimeModule <- proc.time()
set.seed(seedNum)
# fit.gbm <- train(targetVar~., data=Xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
fit.gbm <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## eXtreme Gradient Boosting 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy 
##   0.3  1          0.6               0.50        50      0.9185008
##   0.3  1          0.6               0.50       100      0.9616314
##   0.3  1          0.6               0.50       150      0.9754842
##   0.3  1          0.6               0.75        50      0.9204375
##   0.3  1          0.6               0.75       100      0.9610162
##   0.3  1          0.6               0.75       150      0.9751652
##   0.3  1          0.6               1.00        50      0.9202552
##   0.3  1          0.6               1.00       100      0.9621554
##   0.3  1          0.6               1.00       150      0.9752335
##   0.3  1          0.8               0.50        50      0.9180907
##   0.3  1          0.8               0.50       100      0.9606972
##   0.3  1          0.8               0.50       150      0.9747779
##   0.3  1          0.8               0.75        50      0.9203691
##   0.3  1          0.8               0.75       100      0.9616997
##   0.3  1          0.8               0.75       150      0.9753019
##   0.3  1          0.8               1.00        50      0.9203008
##   0.3  1          0.8               1.00       100      0.9616769
##   0.3  1          0.8               1.00       150      0.9754386
##   0.3  2          0.6               0.50        50      0.9863978
##   0.3  2          0.6               0.50       100      0.9958761
##   0.3  2          0.6               0.50       150      0.9975849
##   0.3  2          0.6               0.75        50      0.9856459
##   0.3  2          0.6               0.75       100      0.9957393
##   0.3  2          0.6               0.75       150      0.9976304
##   0.3  2          0.6               1.00        50      0.9856459
##   0.3  2          0.6               1.00       100      0.9963317
##   0.3  2          0.6               1.00       150      0.9980406
##   0.3  2          0.8               0.50        50      0.9855092
##   0.3  2          0.8               0.50       100      0.9956026
##   0.3  2          0.8               0.50       150      0.9974026
##   0.3  2          0.8               0.75        50      0.9856687
##   0.3  2          0.8               0.75       100      0.9961495
##   0.3  2          0.8               0.75       150      0.9978127
##   0.3  2          0.8               1.00        50      0.9856687
##   0.3  2          0.8               1.00       100      0.9962634
##   0.3  2          0.8               1.00       150      0.9978355
##   0.3  3          0.6               0.50        50      0.9952837
##   0.3  3          0.6               0.50       100      0.9983595
##   0.3  3          0.6               0.50       150      0.9987241
##   0.3  3          0.6               0.75        50      0.9959900
##   0.3  3          0.6               0.75       100      0.9984962
##   0.3  3          0.6               0.75       150      0.9986785
##   0.3  3          0.6               1.00        50      0.9962178
##   0.3  3          0.6               1.00       100      0.9986102
##   0.3  3          0.6               1.00       150      0.9988152
##   0.3  3          0.8               0.50        50      0.9955571
##   0.3  3          0.8               0.50       100      0.9982456
##   0.3  3          0.8               0.50       150      0.9984962
##   0.3  3          0.8               0.75        50      0.9958533
##   0.3  3          0.8               0.75       100      0.9985418
##   0.3  3          0.8               0.75       150      0.9987469
##   0.3  3          0.8               1.00        50      0.9960811
##   0.3  3          0.8               1.00       100      0.9984735
##   0.3  3          0.8               1.00       150      0.9988152
##   0.4  1          0.6               0.50        50      0.9419002
##   0.4  1          0.6               0.50       100      0.9714741
##   0.4  1          0.6               0.50       150      0.9825701
##   0.4  1          0.6               0.75        50      0.9442925
##   0.4  1          0.6               0.75       100      0.9718159
##   0.4  1          0.6               0.75       150      0.9825245
##   0.4  1          0.6               1.00        50      0.9442242
##   0.4  1          0.6               1.00       100      0.9719070
##   0.4  1          0.6               1.00       150      0.9828207
##   0.4  1          0.8               0.50        50      0.9424470
##   0.4  1          0.8               0.50       100      0.9716564
##   0.4  1          0.8               0.50       150      0.9828663
##   0.4  1          0.8               0.75        50      0.9438596
##   0.4  1          0.8               0.75       100      0.9715197
##   0.4  1          0.8               0.75       150      0.9827979
##   0.4  1          0.8               1.00        50      0.9443153
##   0.4  1          0.8               1.00       100      0.9720893
##   0.4  1          0.8               1.00       150      0.9828890
##   0.4  2          0.6               0.50        50      0.9919116
##   0.4  2          0.6               0.50       100      0.9972887
##   0.4  2          0.6               0.50       150      0.9979722
##   0.4  2          0.6               0.75        50      0.9917293
##   0.4  2          0.6               0.75       100      0.9975621
##   0.4  2          0.6               0.75       150      0.9983823
##   0.4  2          0.6               1.00        50      0.9923217
##   0.4  2          0.6               1.00       100      0.9976988
##   0.4  2          0.6               1.00       150      0.9984507
##   0.4  2          0.8               0.50        50      0.9917749
##   0.4  2          0.8               0.50       100      0.9971748
##   0.4  2          0.8               0.50       150      0.9981545
##   0.4  2          0.8               0.75        50      0.9916838
##   0.4  2          0.8               0.75       100      0.9975393
##   0.4  2          0.8               0.75       150      0.9983368
##   0.4  2          0.8               1.00        50      0.9921394
##   0.4  2          0.8               1.00       100      0.9977216
##   0.4  2          0.8               1.00       150      0.9985190
##   0.4  3          0.6               0.50        50      0.9973115
##   0.4  3          0.6               0.50       100      0.9984279
##   0.4  3          0.6               0.50       150      0.9985646
##   0.4  3          0.6               0.75        50      0.9974254
##   0.4  3          0.6               0.75       100      0.9987013
##   0.4  3          0.6               0.75       150      0.9988380
##   0.4  3          0.6               1.00        50      0.9977444
##   0.4  3          0.6               1.00       100      0.9987241
##   0.4  3          0.6               1.00       150      0.9988608
##   0.4  3          0.8               0.50        50      0.9973798
##   0.4  3          0.8               0.50       100      0.9985646
##   0.4  3          0.8               0.50       150      0.9986329
##   0.4  3          0.8               0.75        50      0.9974482
##   0.4  3          0.8               0.75       100      0.9987013
##   0.4  3          0.8               0.75       150      0.9988380
##   0.4  3          0.8               1.00        50      0.9978583
##   0.4  3          0.8               1.00       100      0.9987241
##   0.4  3          0.8               1.00       150      0.9987697
##   Kappa    
##   0.9103509
##   0.9577945
##   0.9730326
##   0.9124812
##   0.9571178
##   0.9726817
##   0.9122807
##   0.9583709
##   0.9727569
##   0.9098997
##   0.9567669
##   0.9722556
##   0.9124060
##   0.9578697
##   0.9728321
##   0.9123308
##   0.9578446
##   0.9729825
##   0.9850376
##   0.9954637
##   0.9973434
##   0.9842105
##   0.9953133
##   0.9973935
##   0.9842105
##   0.9959649
##   0.9978446
##   0.9840602
##   0.9951629
##   0.9971429
##   0.9842356
##   0.9957644
##   0.9975940
##   0.9842356
##   0.9958897
##   0.9976190
##   0.9948120
##   0.9981955
##   0.9985965
##   0.9955890
##   0.9983459
##   0.9985464
##   0.9958396
##   0.9984712
##   0.9986967
##   0.9951128
##   0.9980702
##   0.9983459
##   0.9954386
##   0.9983960
##   0.9986216
##   0.9956892
##   0.9983208
##   0.9986967
##   0.9360902
##   0.9686216
##   0.9808271
##   0.9387218
##   0.9689975
##   0.9807769
##   0.9386466
##   0.9690977
##   0.9811028
##   0.9366917
##   0.9688221
##   0.9811529
##   0.9382456
##   0.9686717
##   0.9810777
##   0.9387469
##   0.9692982
##   0.9811779
##   0.9911028
##   0.9970175
##   0.9977694
##   0.9909023
##   0.9973183
##   0.9982206
##   0.9915539
##   0.9974687
##   0.9982957
##   0.9909524
##   0.9968922
##   0.9979699
##   0.9908521
##   0.9972932
##   0.9981704
##   0.9913534
##   0.9974937
##   0.9983709
##   0.9970426
##   0.9982707
##   0.9984211
##   0.9971679
##   0.9985714
##   0.9987218
##   0.9975188
##   0.9985965
##   0.9987469
##   0.9971178
##   0.9984211
##   0.9984962
##   0.9971930
##   0.9985714
##   0.9987218
##   0.9976441
##   0.9985965
##   0.9986466
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
##  and subsample = 1.
proc.time()-startTimeModule
##      user    system   elapsed 
## 34947.968   131.044 17676.156
if (notifyStatus) email_notify(paste("Gradient Boosting modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2a2d45ba}"

4.d) Compare baseline algorithms

results <- resamples(list(LDA=fit.lda, CART=fit.cart, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LDA, CART, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA     0.8395990 0.8418774 0.8446115 0.8456140 0.8494532 0.8528139    0
## CART    0.3634085 0.4534062 0.4535202 0.4357257 0.4540328 0.4543176    0
## BagCART 0.9854181 0.9863295 0.9876965 0.9879471 0.9895762 0.9906585    0
## RF      0.9974937 0.9988608 0.9990886 0.9989747 0.9993165 0.9995443    0
## GBM     0.9977216 0.9988608 0.9988608 0.9988608 0.9990886 0.9995443    0
## 
## Kappa 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA     0.8235589 0.8260652 0.8290727 0.8301754 0.8343985 0.8380952    0
## CART    0.2997494 0.3987469 0.3988722 0.3792982 0.3994361 0.3997494    0
## BagCART 0.9839599 0.9849624 0.9864662 0.9867419 0.9885338 0.9897243    0
## RF      0.9972431 0.9987469 0.9989975 0.9988722 0.9992481 0.9994987    0
## GBM     0.9974937 0.9987469 0.9987469 0.9987469 0.9989975 0.9994987    0
dotplot(results)

cat('The average accuracy from all models is:',
    mean(c(results$values$`LDA~Accuracy`,results$values$`CART~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)),'\n')
## The average accuracy from all models is: 0.8534245
cat('Total training time for all models:',proc.time()-startModeling)
## Total training time for all models: 41292.75 167.555 24051.11 0 0

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Random Forest
if (notifyStatus) email_notify(paste("Algorithm #1 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@457e2f02}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(2, 17, 33, 48))
fit.final1 <- train(targetVar~., data=Xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)

print(fit.final1)
## Random Forest 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9990203  0.9989223
##   17    0.9954887  0.9950376
##   33    0.9918660  0.9910526
##   48    0.9899066  0.9888972
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##     user   system  elapsed 
## 6759.321    8.359 6780.883
if (notifyStatus) email_notify(paste("Algorithm #1 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@cb5822}"
# Tuning algorithm #2 - Gradient Boosting
if (notifyStatus) email_notify(paste("Algorithm #2 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@28d25987}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(nrounds=c(100, 150, 200, 300), max_depth=3, eta=0.4, gamma=0, colsample_bytree=0.6, min_child_weight=1, subsample=1)
fit.final2 <- train(targetVar~., data=Xy_train, method="xgbTree", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)

print(fit.final2)
## eXtreme Gradient Boosting 
## 
## 43890 samples
##    48 predictor
##    11 classes: '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 39501, 39501, 39501, 39501, 39501, 39501, ... 
## Resampling results across tuning parameters:
## 
##   nrounds  Accuracy   Kappa    
##   100      0.9988152  0.9986967
##   150      0.9989975  0.9988972
##   200      0.9989747  0.9988722
##   300      0.9989975  0.9988972
## 
## Tuning parameter 'max_depth' was held constant at a value of 3
##  0.6
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
##  and subsample = 1.
proc.time()-startTimeModule
##     user   system  elapsed 
## 2062.975    3.140 1040.493
if (notifyStatus) email_notify(paste("Algorithm #2 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@59f99ea}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.9977216 0.9988608 0.9993165 0.9990203 0.9993165 0.9995443    0
## GBM 0.9979494 0.9987469 0.9990886 0.9989975 0.9993165 0.9995443    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.9974937 0.9987469 0.9992481 0.9989223 0.9992481 0.9994987    0
## GBM 0.9977444 0.9986216 0.9989975 0.9988972 0.9992481 0.9994987    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5c3bd550}"

6.a) Predictions on validation dataset

predictions <- predict(fit.final1, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5    6    7    8    9   10   11
##         1  1327    0    0    0    0    0    0    0    0    0    0
##         2     0 1327    0    0    0    0    0    0    4    2    0
##         3     0    0 1329    0    0    0    0    0    0    0    0
##         4     0    0    0 1329    0    0    0    0    0    0    0
##         5     0    0    0    0 1327    0    0    0    0    0    0
##         6     2    0    0    0    0 1328    0    0    0    0    0
##         7     0    0    0    0    0    0 1329    0    0    0    0
##         8     0    0    0    0    2    0    0 1329    1    0    0
##         9     0    0    0    0    0    1    0    0 1323    0    0
##         10    0    2    0    0    0    0    0    0    1 1327    0
##         11    0    0    0    0    0    0    0    0    0    0 1329
## 
## Overall Statistics
##                                           
##                Accuracy : 0.999           
##                  95% CI : (0.9983, 0.9994)
##     No Information Rate : 0.0909          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9989          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.99850  0.99850  1.00000  1.00000  0.99850  0.99925
## Specificity           1.00000  0.99955  1.00000  1.00000  1.00000  0.99985
## Pos Pred Value        1.00000  0.99550  1.00000  1.00000  1.00000  0.99850
## Neg Pred Value        0.99985  0.99985  1.00000  1.00000  0.99985  0.99992
## Prevalence            0.09091  0.09091  0.09091  0.09091  0.09091  0.09091
## Detection Rate        0.09077  0.09077  0.09091  0.09091  0.09077  0.09084
## Detection Prevalence  0.09077  0.09118  0.09091  0.09091  0.09077  0.09098
## Balanced Accuracy     0.99925  0.99902  1.00000  1.00000  0.99925  0.99955
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           1.00000  1.00000  0.99549   0.99850   1.00000
## Specificity           1.00000  0.99977  0.99992   0.99977   1.00000
## Pos Pred Value        1.00000  0.99775  0.99924   0.99774   1.00000
## Neg Pred Value        1.00000  1.00000  0.99955   0.99985   1.00000
## Prevalence            0.09091  0.09091  0.09091   0.09091   0.09091
## Detection Rate        0.09091  0.09091  0.09050   0.09077   0.09091
## Detection Prevalence  0.09091  0.09111  0.09057   0.09098   0.09091
## Balanced Accuracy     1.00000  0.99989  0.99771   0.99913   1.00000
predictions <- predict(fit.final2, newdata=Xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5    6    7    8    9   10   11
##         1  1326    0    0    0    0    0    0    0    1    0    0
##         2     0 1327    0    0    0    0    0    0    1    1    0
##         3     0    0 1328    5    0    0    0    0    2    0    0
##         4     0    0    0 1324    0    0    0    0    0    0    0
##         5     0    0    0    0 1328    0    0    0    0    0    0
##         6     3    0    1    0    0 1329    0    0    0    0    0
##         7     0    0    0    0    0    0 1329    0    0    0    0
##         8     0    0    0    0    1    0    0 1329    0    0    0
##         9     0    0    0    0    0    0    0    0 1325    0    0
##         10    0    2    0    0    0    0    0    0    0 1328    0
##         11    0    0    0    0    0    0    0    0    0    0 1329
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9981, 0.9993)
##     No Information Rate : 0.0909          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9987          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.99774  0.99850  0.99925  0.99624  0.99925  1.00000
## Specificity           0.99992  0.99985  0.99947  1.00000  1.00000  0.99970
## Pos Pred Value        0.99925  0.99850  0.99476  1.00000  1.00000  0.99700
## Neg Pred Value        0.99977  0.99985  0.99992  0.99962  0.99992  1.00000
## Prevalence            0.09091  0.09091  0.09091  0.09091  0.09091  0.09091
## Detection Rate        0.09070  0.09077  0.09084  0.09057  0.09084  0.09091
## Detection Prevalence  0.09077  0.09091  0.09132  0.09057  0.09084  0.09118
## Balanced Accuracy     0.99883  0.99917  0.99936  0.99812  0.99962  0.99985
##                      Class: 7 Class: 8 Class: 9 Class: 10 Class: 11
## Sensitivity           1.00000  1.00000  0.99699   0.99925   1.00000
## Specificity           1.00000  0.99992  1.00000   0.99985   1.00000
## Pos Pred Value        1.00000  0.99925  1.00000   0.99850   1.00000
## Neg Pred Value        1.00000  1.00000  0.99970   0.99992   1.00000
## Prevalence            0.09091  0.09091  0.09091   0.09091   0.09091
## Detection Rate        0.09091  0.09091  0.09064   0.09084   0.09091
## Detection Prevalence  0.09091  0.09098  0.09064   0.09098   0.09091
## Balanced Accuracy     1.00000  0.99996  0.99850   0.99955   1.00000

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
set.seed(seedNum)

# Combining datasets to form a complete dataset that will be used to train the final model
Xy_complete <- rbind(Xy_train, Xy_test)

# library(randomForest)
# finalModel <- randomForest(targetVar~., Xy_complete, mtry=3, na.action=na.omit)
# summary(finalModel)
proc.time()-startTimeModule
##    user  system elapsed 
##   0.041   0.002   0.043

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_MultiClass.rds")
if (notifyStatus) email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@39c0f4a}"
proc.time()-startTimeScript
##      user    system   elapsed 
## 50180.710   180.162 31956.413